[DRAFT] Personal Auto Pricing

A tutorial for P&C pricing using R.

Author (Affiliation) , Author (Affiliation)
2018-12-18

Table of Contents


Advances in machine learning and open source technologies are having a profound impact on many industries, and insurance is no exception. In the US, both the Society of Actuaries and CAS Institute list knowledge of R as a requirement in exam syllabi (Society of Actuaries 2018; CAS Institute 2018). In a draft white paper on regulatory review of predictive models, the National Association of Insurance Commisioners (NAIC) recognizes that actuaries are using techniques such as random forests and neural networks in predictive modeling projects (Casualty Actuarial and Statistical (C) Task Force 2010). However, while much has been written on modeling methodology for pricing, there have been few freely available resources that include reproducible code; one such example is (Alfredo 2011). To the best of our knowledge, prior to the current article, no tutorials exist that show the end-to-end pricing workflow.

In this tutorial, we walk through an example of a pricing project using R, from preparing raw data for modeling to filing. Throughout the process, we provide reproducible code for each analysis task so the reader can follow along. We also attempt to adhere to the latest regulatory guidelines, so actuaries can better adopt the workflow presented in their work.

Data

In this tutorial, we use the publicly available data from Brazil’s Superintendence of Private Insurance (SUSEP). It maintains policy-characteristics-level data for personal auto from 2007 to 2014 for all insured vehicles in Brazil. This data source was chosen because it contains a variety of variables, from policyholder characteristics to losses by peril, to support a nontrivial analysis, and contains a variety of mapping tables so we can demonstrate a broader range of data manipulation work found in a typical pricing assignment. You can access the data from SUSEP’s AUTOSEG website.

(Talk about using 2013 data and validation scheme, see issues/41)

Data details

With the help of documentation accompanying the datasets, we compile a list of tables shown below.

(these need to be in better English see issues/42)

File Description
arq_casco Exposure data, premiums, claims and insured amount for the CASCO overhead, classified by the Key Category Rate / Region / Model / Year / Sex / Age Range
arq_casco3 Exposure data, premiums and claims for the CASCO overhang, classified by the Key Rate Category / CEP / Model / Year key
arq_casco4 Exposure data, premiums and claims for the CASCO overhang, classified by the Key Rate Category / City / Model / Year
premreg Regional distribution of prices
sinreg Regional distribution of claims
File Description
auto2_vei FIPE code and description of each vehicle model, in addition to the group code to which it belongs
auto2_group Code and description of model groups it contains
auto_cat Description code of tariff categories
auto_cau Code and description of causes of accidents
auto_cep Correlates the CEP with cities and regions of circulation
auto_cob Code and description of covers
auto_idade Code and description of age groups
auto_reg Code and description of regions of circulation
auto_sexo Code and description of sex (male, female, legal)
auto_city Code and name of cities

The excerpts for each of the source tables, before any transformations, can be found at Raw Table Excerpts.

We note that, for the purpose of this exercise, we’ll use the arq_casco policy table since it is the only one that contains poilcyholder characteristics (sex and age.) We are unable to utilize arq_casco3 and arq_casco4, which contain more granular location information, since there are no keys to join by; presumably the data is presented this way to preserve privacy.

Data prep

Because the source data is in Portuguese, the first order of business is to obtain translations of the column names so we can all understand them. TODO: dictionary

Once that’s done, we’ll take an initial look at the tables:

arq_casco

Skim summary statistics
n obs: 2416213
n variables: 22

Variable type: character

variable missing complete n min max empty n_unique
age_code 0 2416213 2416213 1 1 0 6
data_year 0 2416213 2416213 5 5 0 1
region_code 0 2416213 2416213 2 2 0 43
sex_code 0 2416213 2416213 1 1 0 4
vehicle_category_code 0 2416213 2416213 1 1 0 8
vehicle_code 0 2416213 2416213 8 8 0 5364

Variable type: integer

variable missing complete n mean sd p0 p25 p50 p75 p100 hist
vehicle_year 12 2416201 2416213 2006.91 5.16 1935 2005 2008 2011 2020 <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587><U+2586>

Variable type: numeric

variable missing complete n mean sd p0 p25 p50 p75 p100 hist
average_insured_amount 0 2416213 2416213 41619.57 40002.72 0 20876.66 31596.81 50560.36 9593245.26 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_amount_collision_partial 0 2416213 2416213 841.11 5979.58 0 0 0 0 2351086 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_amount_collision_total_loss 0 2416213 2416213 542.3 6094.13 0 0 0 0 2300123 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_amount_fire 0 2416213 2416213 16.81 1089.48 0 0 0 0 459690 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_amount_other 0 2416213 2416213 233.82 3886.65 0 0 0 0 1307819 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_amount_theft 0 2416213 2416213 686.06 8070.97 0 0 0 0 2056463 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_count_collision_partial 0 2416213 2416213 0.2 4.59 0 0 0 0 4963 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_count_collision_total_loss 0 2416213 2416213 0.016 0.14 0 0 0 0 11 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_count_fire 0 2416213 2416213 0.00069 0.027 0 0 0 0 3 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_count_other 0 2416213 2416213 0.49 4.64 0 0 0 0 1193 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_count_theft 0 2416213 2416213 0.026 0.26 0 0 0 0 82 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
EXPOSICAO2 0 2416213 2416213 0 0 0 0 0 0 0 <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
exposure 0 2416213 2416213 2.96 13.71 0 0.47 0.56 1.76 5420.13 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
PREMIO2 0 2416213 2416213 0 0 0 0 0 0 0 <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
premium 0 2416213 2416213 3375.05 17282.63 0 377.64 886.77 2330.22 1e+07 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>

auto_cat

Skim summary statistics
n obs: 8
n variables: 2

Variable type: character

variable missing complete n min max empty n_unique
vehicle_category 0 8 8 6 39 0 8
vehicle_category_code 0 8 8 1 1 0 8

auto_reg

Skim summary statistics
n obs: 41
n variables: 2

Variable type: character

variable missing complete n min max empty n_unique
region 0 41 41 8 47 0 41
region_code 0 41 41 2 2 0 41

auto2_vei

Skim summary statistics
n obs: 4756
n variables: 4

Variable type: character

variable missing complete n min max empty n_unique
vehicle_code 0 4756 4756 8 8 0 4756
vehicle_description 0 4756 4756 6 58 0 4756
vehicle_group 0 4756 4756 3 33 0 476
vehicle_group_code 0 4756 4756 1 3 0 476

auto_idade

Skim summary statistics
n obs: 6
n variables: 2

Variable type: character

variable missing complete n min max empty n_unique
age_code 0 6 6 1 1 0 6
age_range 0 6 6 20 20 0 6

auto_cau

Skim summary statistics
n obs: 5
n variables: 2

Variable type: character

variable missing complete n min max empty n_unique
peril 0 5 5 6 19 0 5
peril_code 0 5 5 1 1 0 5

auto2_grupo

Skim summary statistics
n obs: 476
n variables: 2

Variable type: character

variable missing complete n min max empty n_unique
vehicle_group_code 0 476 476 1 3 0 476
vehicle_group_description 0 476 476 3 33 0 476

For the policy table, with the exception of the numeric columns, all of the columns contain coded values which need to be mapped. By referencing the data documentation, we observe the data model diagram below TODO: actually do this.

(need more details here issues/44)

We then perform a series of joins and extract the relevant columns to create a combined data frame, then inspect the result for reasonableness:

Skim summary statistics
n obs: 2416213
n variables: 22

Variable type: character

variable missing complete n min max empty n_unique
age_range 0 2416213 2416213 13 18 0 6
data_year 0 2416213 2416213 5 5 0 1
region 5329 2410884 2416213 8 47 0 41
sex 0 2416213 2416213 8 14 0 4
vehicle_category 0 2416213 2416213 6 39 0 8
vehicle_description 368687 2047526 2416213 6 58 0 4298
vehicle_group 368687 2047526 2416213 3 33 0 445
vehicle_group_description 368687 2047526 2416213 3 33 0 445

Variable type: integer

variable missing complete n mean sd p0 p25 p50 p75 p100 hist
vehicle_year 12 2416201 2416213 2006.91 5.16 1935 2005 2008 2011 2020 <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587><U+2586>

Variable type: numeric

variable missing complete n mean sd p0 p25 p50 p75 p100 hist
average_insured_amount 0 2416213 2416213 41619.57 40002.72 0 20876.66 31596.81 50560.36 9593245.26 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_amount_collision_partial 0 2416213 2416213 841.11 5979.58 0 0 0 0 2351086 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_amount_collision_total_loss 0 2416213 2416213 542.3 6094.13 0 0 0 0 2300123 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_amount_fire 0 2416213 2416213 16.81 1089.48 0 0 0 0 459690 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_amount_other 0 2416213 2416213 233.82 3886.65 0 0 0 0 1307819 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_amount_theft 0 2416213 2416213 686.06 8070.97 0 0 0 0 2056463 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_count_collision_partial 0 2416213 2416213 0.2 4.59 0 0 0 0 4963 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_count_collision_total_loss 0 2416213 2416213 0.016 0.14 0 0 0 0 11 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_count_fire 0 2416213 2416213 0.00069 0.027 0 0 0 0 3 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_count_other 0 2416213 2416213 0.49 4.64 0 0 0 0 1193 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
claim_count_theft 0 2416213 2416213 0.026 0.26 0 0 0 0 82 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
exposure 0 2416213 2416213 2.96 13.71 0 0.47 0.56 1.76 5420.13 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
premium 0 2416213 2416213 3375.05 17282.63 0 377.64 886.77 2330.22 1e+07 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>

(talk about missing values)

EDA

Distributions “One-way” analysis

Exposures by state

(something like this, will need to add tooltip and cut out Brazil)

Modeling

Variable selection, modeling, etc.

Rating plan

Transfer results to plan, dislocation analysis, etc.

Alfredo, Spedicato Giorgio. 2011. “Third Party Motor Liability Ratemaking with R.” Casualty Actuarial Society. https://www.casact.org/research/wp/papers/working-paper-spedicato-%202012-06.pdf.

Casualty Actuarial and Statistical (C) Task Force. 2010. “Regulatory Review of Predictive Models Draft.” National Association of Insurance Commissioners. https://www.naic.org/documents/cmte_c_catf_exposure_predictive_model_white_paper.pdf.

Society of Actuaries. 2018. “Exam Pa: Predictive Analytics.” 2018. https://www.soa.org/Education/Exam-Req/edu-exam-pa-detail.aspx.